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Abstract 

This paper shows how a folded Markov chain network can be apphed 
to the problem of processing data from multiple sensors, with an empha- 
sis on the special case of 2 sensors. It is necessary to design the network 
so that it can transform a high dimensional input vector into a posterior 
probability, for which purpose the partitioned mixture distribution net- 
work is ideally suited. The underlying theory is presented in detail, and a 
simple numerical simulation is given that shows the emergence of ocular 
dominance stripes. 



1 Theory 

1.1 Neural Network Model 

In order to fix ideas, it is useful to give an explicit "neural network" interpreta- 
tion to the theory that will be developed. The model will consist of 2 layers of 
nodes. The input layer has a "pattern of activity" that represents the compo- 
nents of the input vector x, and the output layer has a pattern of activity that 
is the collection of activities of each output node. The activities in the output 
layer depend only on the activities in the input layer. If an input vector x is 
presented to this network, then each output node "fires" discretely at a rate that 
corresponds to its activity. After n nodes have fired the probabilistic descrip- 
tion of the relationship between the input and output of the network is given by 
Pr (yij 7 ■ • • , yn|x), where is the location in the output layer (assumed to be 
on a rectangular lattice of size m) of the i^^ node that fires. In this paper it will 
be assumed that the order in which the n nodes fire is not observed, in which 
case Pr(yi,y2, • • • ,yn|x) is a sum of probabilities over all n\ permutations of 
(yii y27 • • • , yn), which is a symmetric function of the y^, by construction. 

The theory that is introduced in section FOl concerns the special case n = 1. 
In the n — 1 case the probabilistic description Pr (y|x) is proportional to the 
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firing rate of node y in response to input x. When n > 1 there is an indirect 
relationship between the probabilistic description Pr (yi, y2, • • • , yn |x) and the 
firing rate of node y, which is given by the marginal probability 

in 

Pr(y|x)= Pr(y,y2,---,yn|x) (1) 

It is important to maintain this distinction between events that are observed 
(i.e. (yi, y2, • • • , yn) given x) and the probabilistic description of the events that 
are observed (i.e. Pr(yi,y2, • • • ,y„|x)). The only possible exception is in the 
n — >■ cx) limit, where Pr (yi, y2, • • • , yn|x) has all of its probability concentrated 
in the vicinity of those (yi,y2, • • • ,yn) that are consistent with the observed 
long-term average firing rate of each node. It is essential to consider the n > 1 
case to obtain the results that are described in this paper. 



1.2 Probabilistic Encoder/Decoder 

A theory of self-organising networks based on an analysis of a probabilistic 
encoder /decoder was presented in [Tj. It deals with the n = 1 case referred to 
in section 11.11 The objective function that needs to be minimised in order to 
optimise a network in this theory is the Euclidean distortion D defined as 

m „ 

D=Yj <ixfix'Pr(x)Pr(y|x)Pr(x'|y) l|x-x'f (2) 

where x is an input vector, y is a coded version of x (a vector index on a d- 
dimensional rectangular lattice of size m) , x' is a reconstructed version of x from 
y, Pr (x) is the probability density of input vectors, Pr (y|x) is a probabilistic 
encoder, and Pr(x'|y) is a probabilistic decoder which is specified by Bayes' 
theorem as 

^^r(yWPr(^ 
^ '^^ /dx'Pr(y|x')Pr(x') ^ ' 

D can be rearranged into the form pj 

m ^ 

D = 2J2 dxPr(x)Pr(y|x)||x-x'(y)f (4) 

where the reference vectors x' (y) are defined as 

x'(y) = ydxPr(x|y) x (5) 

Although equation[5]is symmetric with respect to interchanging the encoder and 
decoder, equation|3]is not. This is because Bayes' theorem has made explicit the 
dependence of Pr (x|y) on Pr (y|x). From a neural network viewpoint Pr (y|x) 
describes the feed-forward transformation from the input layer to the output 
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layer, and x' (y) describes the feed-back transformation that is implied from 
the output layer to the input layer. The feed-back transformation is necessary 
to implement the objective function that has been chosen here. 

Minimisation of D with respect to all free parameters leads to an optimal 
encoder /decoder. In equation S] the Pr(y|x) are the only free parameters, be- 
cause x' (y) is fixed by equation [S] However, in practice, both Pr(y|x) and 
x' (y) may be treated as free parameters [T], because x' (y) satisfy equation [5] 
at stationary points of D with respect to variation of x' (y) . 



1.3 Posterior Probability Model 

The probabilistic encoder/decoder requires an explicit functional form for the 
posterior probability Pr (y|x). A convenient expression is 

T3 ^ I ^ ^(xly) 

where Q (x|y) > can be regarded as a node "activity", and ^ (yl^) = 1- 

Any non-negative function can be used for (5(x|y), such as a sigmoid (which 
satisfies < Q (x|y) < 1) 

1 + exp (-v^f (y) - x-b^y)) 

where w (y) and h (y) are a weight vector and bias, respectively. 

A drawback to the use of equation [6] is that it does not permit it to scale 
well to input vectors that have a large dimensionality. This problem arises from 
the restricted functional form allowed for Q (x|y). A solution was presented in 

m 

y'eAf(y) J c ; 

where M = mim2 • • • md, and N (y) is a set of lattice points that are deemed 
to be "in the neighbourhood of" the lattice point y, and N (y) is the inverse 
neighbourhood defined as the set of lattice points that have lattice point y in 
their neighbourhood. This expression for Pr (y|x) satisfies ^ (y|x) = 1 

(see appendix lAl. It is convenient to define 

p / I _ Q (x|y) 

'^■^^'^^^^=Ey^^..(yOQ(x|y") 

which is another posterior probability, by construction. It includes the effect of 
the output nodes that are in the neighbourhood of node y' only. Pr (y|x; y') is 
thus a localised posterior probability derived from a localised subset of the node 
activities. This allows equation[H]to be written as Pr (y|x) = X]y'G-/v(y) -'^'^ (yl'*^! y')j 
so Pr (y|x) is the average of the posterior probabilities at node y arising from 
each of the localised subsets that happens to include node y. 
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1.4 Multiple Firing Model 



The model may be extended to the case where n output nodes fire. Pr (y|x) is 
then replaced by Pr (yi , y2 , • • • Yn |x) , which is the probability that (yi , y2 , • • • yn) 
are the first n nodes to fire (in that order). With this modification, D becomes 



m „ 

D = 2 ^ / dxPr(x)Pr(yi,y2,---y„|x) ||x-x'(yi,y2,---y„) 
yi,y2,---yn=i 

where the reference vectors x' (yi, y2, • • • yn) a-re defined as 

x' (yi,y2, • • -Yn) = y"dxPr(x|ypy2,---y„) x 



(10) 



(11) 



The dependence of Pr (yi, y2, • • • y„|x) and x' (yi 

J y2j ■ ■ ■ Yn) on n output node 
locations complicates this result. Assume that Pr (yi, y2, • • • y„|x) is a symmet- 
ric function of its (yi, y2, • ' ' Yn) arguments, which corresponds to ignoring the 
order in which the first n nodes choose to fire (i.e. Pr (yi, y2, • • • Yn|x) is a sum 
over all permutations of (yi, Y2, • • • Yn))- For simplicity, assume that the nodes 
fire independently so that Pr(Yi,Y2|x) = Pr (yijx) Pr (y2|x) (see appendix IB] 
for the general case where Pr (yi, Y2 |x) does not factorise). D may be shown to 
satisfy the inequality D < Di + D2 (see appendix [B|) . where 



2 r 

= / dxPr(x)Pr(y|x)||x-x'(y)|| 



Do 



y=i 
2(n- 1) 



y dxPr (x) 



^Pr(Ylx) (x-x'(y)) 
y=i 



(12) 



D\ and D2 are both non-negative. D\ as n — cx), and D2 — when n = 0, 
so the D\ term is the sole contribution to the upper bound when n = 0, and the 
D2 term provides the dominant contribution as n — > cx). The difference between 
the Di and the D2 terms is the location of the P^' (y|x) (• • •) average: in 

the D2 term it averages a vector quantity, whereas in the Di term it averages 
a Euclidean distance. The D2 term will therefore exhibit interference effects, 
whereas the Di term will not. 



1.5 Probability Leakage 

The model may be further extended to the case where the probability that a 
node fires is a weighted average of the underlying probabilities that the nodes 
in its vicinity fire. Thus Pr (y|x) becomes 



m 

Pr(y|x)^ ^ Pr(y|y')Pr(y'|x) 



(13) 



where Pr (y|y') is the conditional probabiUty that node y fires given that node 
y' would have liked to fire. In a sense, Pr(y|y') describes a "leakage" of prob- 
ability from node y' that onto node y. Pr (y|y') then plays the role of a soft 
"neighbourhood function" for node y'. This expression for Pr (y|x) can be used 
wherever a plain Pr (y|x) has been used before. The main purpose of introduc- 
ing leakage is to encourage neighbouring nodes to perform a similar function. 
This occurs because the effect of leakage is to soften the posterior probability 
Pr (y|x), and thus reduce the ability to reconstruct x accurately from knowledge 
of y, which thus increases the average Euclidean distortion D. To reduce the 
damage that leakage causes, the optimisation must ensure that nodes that leak 
probability onto each other have similar properties, so that it does not matter 
much that they leak. 



1.6 The Model 



The focus of this paper is on minimisation of the upper bound Di + D2 (see 
equation [12) to D in the multiple firing model, using a scalable posterior proba- 
bility Pr (y|x) (see equation[8)), with the effect of activity leakage Pr (y|y') taken 
into account (see equation [T5)) . Gathering all of these pieces together yields 

c) n mm 

Di - ;^yrfxPr(x)^^Pr(y|y') ^ Pr (y'|x; y") ||x - x' (y) f 



y=i y 



y"eN(y') 



Do 



2{n-l) 



dx Pr (x) 



(14) 



E 

y"eN(y') 



Pr(y'|x;y")(x-x' (y)) 



where Pr (y|x; y') 



Q(x|y) 



eN(y') 



Q(x|y") 



In order to ensure that the model is truly scalable, it is necessary to restrict 
the dimensionality of the reference vectors. In equation 1 141 dim x' (y) = dimx, 
which is not acceptable in a scalable network. In practice, it will be assumed any 
properties of node y that are vectors in input space will be limited to occupy 
an "input window" of restricted size that is centred on node y. This restriction 
applies to the node reference vector x' (y), which prevents Di + D2 from being 
fully minimised, because x' (y) is allowed to move only in a subspace of the full- 
dimensional input space. However, useful results can nevertheless be obtained, 
so this restriction is acceptable. 



1.7 Optimisation 

Optimisation is achieved by minimising Di + D2 with respect to its free param- 
eters. Thus the derivatives with respect to x' (y) are given by 

dDi 4 



ax' (y) n M 



j dxPr(x) fi(x,y) 
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and the variations with respect to Q (x|y) are given by 

5Di = ^T.Jd^P^ (x) 91 (x, y) <5 log Q (x|y) 

m „ 

^ J dxPr(x) g2(x,y) <51ogQ(x|y) (16) 



The functions fi (x, y), {2 (x, y), gi (x, y), and 32 (x, y) are derived in appendix 
O Inserting a sigmoidal function Q (x|y) = i+cxp(-w(y)-x-6(y)) ^^"^^ yields the 
derivatives with respect to w (y) and b (y) as 

dDi 



I dxPr(x) 31 (x,y) (1 - Q (x|y)) ^ 

I dxPr(x)g2(x,y) (1 - Q (x|y)) ^ ^(17) 



My) 

w(y) 



/ 6(y) \ nAf 

I w(y) ; 

Because all of the properties of node y that are vectors in input space (i.e. x' (y) 
and w(y)) are assumed to be restricted to an input window centred on node 
y, the eventual result of evaluating the right hand sides of the above equations 
must be similarly restricted to the same input window. 



1.8 The Effect of the EucHdean Norm on Minimising Di + 

The expressions for Di and D2, and especially their derivatives, are fairly com- 
plicated, so an intuitive interpretation will now be presented. When Di + D2 is 
stationary with respect to variations of x' (y) it may be written as (see appendix 

D^+D2 = /dxPr(x)E™iPr(yWI|x'(y)f 



2(n-l) 



/dxPr(x) E™iPr(y|x)x'(y) 



(18) 



+ constant 

The M and AP factors do not appear in this expression because Pr(y|x) is 
normalised to sum to unity. The first term (which derives from Di) is an 
incoherent sum (i.e. a sum of Euclidean distances), whereas the second term 
(which derives from D2) is a coherent sum (i.e. a sum of vectors). The first 
term contributes for all values of n, whereas the second term contributes only 
for ri > 2, and dominates for n ^ 1. In order to minimise the first term the 
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||x' (y) 11^ like to be as large as possible for those nodes that have a large Pr (y |x) . 
Since x' (y) is the centroid of the probability density Pr (x|y), this implies that 
node y prefers to encode a region of input space that is as far as possible from the 
origin. This is a consequence of using a Euclidean distortion measure ||x — x'||^, 
which has the dimensions of ||x||^, in the original definition of the distortion in 
equation [51 In order to minimise the second term the superposition of x' (y) 
weighted by the Pr(y|x) likes to have as large a Euclidean norm as possible. 
Thus the nodes co-operate amongst themselves to ensure that the nodes that 

2 

have a large Pr (y|x) also have a large Y^^=i (y|^) ^' (y) 

2 Solvable Analytic Model 

The purpose of this section is to work through a case study in order to demon- 
strate the various properties that emerge when Di + D2 is minimised. 

2.1 The Model 

It convenient to begin by ignoring the effects of leakage Pr (y|y'), and to con- 
centrate on a simple (non-scaling) version of the posterior probability model 

(as in equation in]) Pr(y|x) — ^J^^^^^^ , where the Q (x|y) are threshold 

z^y'=i Q'-^\y'> 

functions of x 

„ , I N _ / below threshold . 
Q Wy) = I ^ j^bove threshold ^ ' 

It is also convenient to imagine that a hypothetical infinite-sized training set 
is available, so it may be described by a probability density Pr(x). This is a 
"frequentist", rather than a "Bayesian", use of the Pr (x) notation, but the dis- 
tinction is not important in the context of this paper. Assume that x = (xi, X2) 
is drawn from a training set, that has 2 statistically independent subspaces, so 
that 

Pr(xi,X2) = Pr(xi)Pr(x2) (20) 
Furthermore, assume that Pr (xi) and Pr (X2) each have the form 

Pr(x,) = — / d^,;(5(x, -x,(0O) (21) 
Jo 

i.e. Pr (xi) is a loop (parameterised by a phase angle 9i) of probability density 
that sits in Xj-space. In order to make it easy to deduce the optimum reference 
vectors, choose x^ {9i) so that the following 2 conditions are satisfied for i = 1,2 



x,;(0,:)|| = constant 

= constant (22) 



This type of training set can be visualised topologically. Each training vector 
(xi , X2 ) consists of 2 subvectors, each of which is parameterised by a phase angle. 
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Figure 1: Representation of x topology with a threshold Q {x\y) super- 
imposed. 



and which therefore lives in a subspace that has the topology of a circle, which 
is denoted as S^. Because of the independence assumption in equation [20l the 
pair (xi, X2) lives on the surface of a 2-torus, which is denoted as x S^. The 
minimisation of Di + D2 thus reduces to finding the optimum way of designing 
an encoder/decoder for input vectors that live on a 2-torus, with the proviso 
that their probability density is uniform (this follows from equation [5T] and 
equation . In order to derive the reference vectors x' (y), the solution(s) of 
the stationarity condition ^^Q^^^-j^'^ = must be computed. The stationarity 
condition reduces to (see appendix [P]) 



n J dxi dx2Pr(xi,X2|y) (xi,X2) 

= - 1) Ey=i (/ rfxi (ix2 Pr (xi, X2|y) Pr (y'|xi, X2)) (x^ {y') , x^ (y')) 
+ (xl(y),x^ (y)) 

(23) 

It is useful to use the simple diagrammatic notation shown in figured?!] Each 

circle in figure \TJ\ represents one of the subspaces, so the two circles together 

represent the product x S^. The constraints in equation [52] are represented by 

each circle being centred on the origin of its subspace (||xi (0i)ll^ constant), 

2 



is 



and the probability density around each circle being constant ( 

constant). A single threshold function Q (x|y) is represented by a chord cut- 
ting through each circle (with and 1 indicating on which side of the chord 
the threshold is triggered). The x^ that lie above threshold in each subspace 
are highlighted. Both xi and X2 must lie above threshold in order to ensure 
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Figure 2: Explicit representation of x topology as a torus with the effect 
of 3 different types of threshold Q {x\y) shown. 

Q (x|y) = 1, i.e. they must both lie within regions that are highlighted in figure 
12.11 In this case node y will be said to be "attached" to both subspace 1 and 
subspace 2. A special case arises when the chord in one of the subspaces (say 
it is X2) does not intersect the circle at all, and the circle lies on the side of the 
chord where the threshold is triggered. In this case Q (x|y) does not depend on 
X2, so that Pr(y|x-^,X2) = Pr(y|xj^), in which case node y will be said to be 
"attached" to subspace 1 but "detached" from subspace 2. The typical ways in 
which a node becomes attached to the 2-torus are shown in figure \TJ\ In figure 
I2.ir a) the node is attached to one of the subspaces and detached from the 
other. In figure 12. K b) the attached and detached subspaces are interchanged 
with respect to figure EUJa) . In figure I2.1f c) the node is attached to both 
subspaces. 

2.2 All Nodes Attached to One Subspace 

Consider the configuration of threshold functions shown in figure 12.21 This is 
equivalent to all of the nodes being attached to loops to cover the 2-torus, with 
a typical node being as shown in figure [2TT a) (or, equivalently, figure ITlT b)). 
When Di + D2 is minimised, it is assumed that the 4 nodes are symmetrically 
disposed in subspace 1, as shown. Each is triggered if and only if Xi lies within 
its quadrant, and one such quadrant is highlighted in figure [521 This implies 
that only 1 node is triggered at a time. The assumed form of the threshold 
functions implies Pr(y|x-^,X2) = Pr(2/|x-^), so equation [23l reduces to 

n J dxi(ix2Pr(xi|y)Pr(x2) (xi,X2) 



/ dxi dx2 Pr (xi \y) Pr (X2) 



(24) 



X {{n 1) E^Li Pr (y' |xi) (x'l (y') , x^ (y')) + K iv) , A iv))) 



whence 



x'l (y) 
^2 (y) 








(25) 
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Figure 3: 16 nodes are shown, which are all attached to subspace 1, and all 
detached from subspace 2. 



2.3 All Nodes Attached to Both Subspaces 

Consider the configuration of threshold functions shown in figure 12.31 This is 
equivalent to all of the nodes being attached to patches to cover the 2-torus, 
with a typical node being as shown in figure l^TIT c). In this case, when D1 + D2 is 
minimised, it is assumed that each subspace is split into 2 halves. This requires 
a total of 4 nodes, each of which is triggered if, and only if, both xi and X2 lie 
on the corresponding half-circles. This implies that only 1 node is triggered at a 
time. The assumed form of the threshold functions implies that the stationarity 
condition becomes 



n Pr(j/) / dxidx2Pr(xi,X2|y) (xi,X2) 
= Pr (y) / dxi dx2 Pr (xi , X2 1 y) 

X {{n - 1) EyU (j/'lxi, X2) (xi iy') , x'2 (y')) + (x'^ (y) , x^ (y))) 



(26) 



whence 



x'l (y) = J dxiPr(xi|?/)xi 
X2 iy) = J dx2 Pr (x2 |y) X2 



(27) 



2.4 Half the Nodes Attached to One Subspace, and Half 
to the Other Subspace 

Consider the configuration of threshold functions shown in figure 12.41 This is 
equivalent to half of the nodes being attached to loops to cover the 2-torus, 
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1 



2 



Figure 4: 16 nodes are shown, which are all attached to both subspace 1 and 
subspace 2. 



with a typical node being as shown in figure lOT a). The other half of the nodes 
would then be attached in an analogous way, but as shown in figure I2.1f b) . 
Thus the 2-torus is covered twice over. In this case, when Di + D2 is minimised, 
it is assumed that each subspace is split into 2 halves. This requires a total 
of 4 nodes, each of which is triggered if Xi (or X2) lies on the half-circle in 
the subspace to which the node is attached. Thus exactly 2 nodes yi (xi) and 
2/2 (X2) are triggered at a time, so that 



For simplicity, assume that node y is attached to subspace 1, then Pr (xi, X2|y) = 
Pr (xily) Pr (X2) and the stationarity condition becomes 



Pr(y|xi,X2) = - ((5y,yi(xi) +'5.y,y2(x2)) 

= i(Pr(y|xi)+Pr(y|x2)) 



(28) 



n Pr(?;) / dxi dx2 Pr (xi|?/) Pr (X2) (xi,X2) 



Pr (y) / dxi dx2 Pr (xi \y) Pr (X2) 



(29) 




^ EyU (Pr (y'lxi) + Pr (2/'|x2)) (x', {y') , x^ {y')) 
+ (xi(j;),x'2(y)) 
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Figure 5: 16 nodes are shown, 8 of which are attached to subspace 1 and de- 
tached from subspace 2 (top row), and 8 of which are attached to subspace 2 
and detached from subspace 1 (bottom row). 
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This may be simplified to yield 

n J (ixiPr(xily) (xi,0) 



= ^(xi(y),x'2(y)) 



+ n_i J pr (X2) J:^^^ Pr (y'|x2) (xi (y') , x'^ (y')) 
= (x'l (y) , x'2 (y)) + ^ (x'l (y) , x'^ (y)}. 



(30) 



Write the 2 subspaces separately (remember that node y is assumed to be at- 
tached to subspace 1) 



x'l (y) 



2n 



dxi Pr (xi|t/) xi 



1 



n + l 



(x'l (y)). 



ft — 1 

A{y) = (x'2 (2/))2 



(31) 



If this result is simultaneously solved with the analogous result for node y at- 
tached to subspace 2, then the (• ■ •) terms vanish to yield 



x'l {y) 
x'2 iy) 



J dxi Pr (xi|y) xi y attached to subspace 1 

y attached to subspace 2 

y attached to subspace 1 

/ (ix2 Pr (x2|y) X2 y attached to subspace 2 



(32) 



2.5 Compare Di + D2 for the 3 Different Types of Solution 



Consider the left hand side of figure [2?2] for the case of M nodes, when the M 
threshold functions form a regular Af-ogon. Pr(x|y) then denotes the part of 
the circle that is associated with node y, whose radius of gyration squared is 
given by (assuming that the circle has unit radius) 



Rm — 



J rfxPr (x|y) X 



M 
2^ 



d9 cos 9 



M 27r 
— sm — 
27r M 



(33) 



Gather the results for (x'^ (y) ,x'2 (y)) in equations [25l (referred to as type 1). [27l 
(referred to as type 2), and [32] (referred to as type 3) together and insert them 
into Di + D2 in equation [18] to obtain (see appendix IE| 



D1+D2 



constant — 2Rm 
constant — 4i? 
constant %r 

n+l 



M 

R A. 



type 1 
type 2 
type 3 



(34) 
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5 10 15 20 25 30 35 40 



Figure 6: Plots of —Di — D2 for n = 1 for each of the 3 types of optnnum. 

In figure [275] the 3 solutions are plotted for the case n = 1. For n = 1 the type 
3 solution is never optimal, the type 1 solution is optimal for M < 19, and the 
type 2 solution is optimal for M > 20. This behaviour is intuitively sensible, 
because a larger number of nodes is required to cover a 2-torus as shown in 
figure I^TlT c) than as shown in figure I^TlT a) (or figure I^TlT b)). 

In figure [2751 the 3 solutions are plotted for the case n — 2. For n — 2 the 
type 1 solution is optimal for M < 12, and the type 2 solution is optimal for 
large M > 30, but there is now an intermediate region 12 < M < 29 (type 1 and 
type 3 have an equal Di + D2 at AI ~ 12) where the n-dependence of the type 
3 solution has now made it optimal. Again, this behaviour is intuitively reason- 
able, because the type 3 solution requires at least 2 observations in order to be 
able to yield a small Euclidean resonstruction error in each of the 2 subspaces, 
i.e. for n = 2 the 2 nodes that fire must be attached to different subspaces. 
Note that in the type 3 solution the nodes that fire are not guaranteed to be 
attached to different subspaces. In the type 3 solution there is a probability 
l^rn^n^ that rii (where n — rii + nodes are attached to subspace i, so the 
trend is for the type 3 solution to become more favoured as n is increased. 

In figure [^75] the 3 solutions are plotted for the case n — 00. For n — 00 
the type 2 solution is never optimal , the type 1 solution is optimal for M < 8, 
and the type 3 solution is optimal for M > 8. The type 2 solution approaches 
the type 3 solution from below asymptotically as M — > 00. In figure [^751 a phase 
diagram is given which shows how the relative stability of the 3 types of solution 
for different M and n, where the type 3 solution is seen to be optimal over a large 
part of the (M, n) plane. Thus the most interesting, and commonly occurring, 
solution is the one in which half the nodes are attached to one subspace and 
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16 



half to the other subspace (i.e. solution type 3). Although this result has been 
derived using the non-scaling version of the posterior probability model Pr (y|x) 
(as in equation[B]) , it may also be used for scaling posterior probabilities (as given 
in equation |5]) in certain limiting cases, and also for cases where the effect of 



2.6 Various Extensions 

2.6.1 The Effect of Leakage 

The effect of leakage will not be analysed in detail here. However, its effect 
may readily be discussed phenomenologically, because the optimisation acts to 
minimise the damaging effect of leakage on the posterior probability by ensuring 
that the properties of nodes that are connected by leakage are similar. This 
has the most dramatic effect on the type 3 solution, where the way in which 
the nodes are partitioned into 2 halves must be very carefully chosen in order 
to minimise the damage due to leakage. If the leakage is presumed to be a 
local function, so that Pr {y\y') — tt (y — y'), which is a localised "blob"-shaped 
function, then the properties of adjacent node are similar (after optimisation). 
Since nodes that are attached to 2 different subspaces necessarily have very 
different properties, whereas nodes that are attached to the same subspace can 
have similar properties, it follows that the nodes must split into 2 continguous 
halves, where nodes 1,2, are attached to subspace 1 and nodes ^ + 

1, + 2, • • • , M are attached to subspace 2, or vice versa. The effect of leakage 
is thereby minimised, with the worst effect occurring at the boundary between 
the 2 halves of nodes. 

2.6.2 Modifying the Posterior Probability to Become Scalable 

The above analysis has focussed on the non-scaling version of the posterior 
probability, in which all M nodes act together as a unit. The more general 
scaling case where the M nodes are split up by the effect of the neighbourhood 
function N (y) will not be analysed in detail, because many of its properties 
are essentially the same as in the non-scaling case. For simplicity assume that 
the neighbourhood function N (y) is a "top-hat" with width w (an odd inte- 
ger) centred on y. Impose periodic boundary conditions so that the inverse 
neighbourhood function N (y) is also a top-hat, N (y) = N (y). In this case an 
optimum solution in the non-scaling case (with M = w) can be directly related 
to a corresponding optimum solution in the scaling case by simply repeating 
the node properties periodically every w nodes. Strictly speaking, higher order 
periodicities can also occur in the scaling case (and can be favoured under cer- 
tain conditions), where the period is ^ (k is an integer), but these will not be 
discussed here. 

The effect of the periodic replication of node properties is interesting. The 
type 3 solution (with leakage and with AI = w) splits the nodes into 2 halves, 
where nodes 1, 2, • • • , ^ are attached to subspace 1 and nodes ^ + l,^+2,---,z/; 
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are attached to subspace 2, or vice versa. When this is rephcated periodically 
every w nodes it produces an alternating structure of node properties, where ^ 
nodes are attached to subspace 1, then the next ^ nodes are attached to sub- 
space 2, and thenthe next ^ nodes are attached to subspace 1, and so on. This 
behaviour is reminiscent of the so-called "dominance stripes" that are observed 
in the mammalian visual cortex. 

3 Experiment 

The purpose of this section is to demonstrate the emergence of the dominance 
stripes in numerical simulations. The main body of the software is concerned 
with evaluating the derivatives of Di -f D2, and the main difficulty is choosing 
an appropriate form for the leakage (this has not yet been automated). 

3.1 The Parameters 

The parameters that are required for a simulation are as follows: 

1. (mi,m2): size of 2D rectangular array of nodes. M — mim2. 

2. (ii, 12): size of 2D rectangular input window for each node (odd integers). 
Ensure that the input window is not too many input data "correlation 
areas" in size, otherwise dominance stripes may not emerge. Dominance 
stripes require that the correlation within an input window are substan- 
tially stronger than the correlations between input windows that are at- 
tached to different subspaces. 

3. (wi,W2)- size of 2D rectangular neighbourhood window for each node 
(odd integers). The neighbourhood function N {yi,y2) is a rectangular 
top-hat centred on (j/i, 1/2) ■ The size of the neighbourhood window has to 
lie within a limited range to ensure that dominance stripes are produced. 
This corresponds to ensuring that M lies in the type 3 region of the phase 
diagram in figure [^751 It is also preferable for the size of the neighbourhood 
window to be substantially smaller than the input window, otherwise dif- 
ferent parts of a neighbourhood window will see different parts of the input 
data, which will make the network behaviour more difficult to interpret. 

4. (^1, Z2): size of 2D rectangular leakage window for each node (odd integers). 
For simplicity the leakage Pr (y|y ) is assumed to be given by Pr (y|y ) — 
TT (y — y'), where tt (y — y') is a "top-hat" function of y — y' which covers 
a rectangular region of size (^1,^2) centred on y — y' = 0. The size of 
the leakage window must be large enough to correlate the parameters of 
adjacent nodes, but not so large that it enforces such strong correlations 
between the node parameters that it destroys dominance stripes. 

5. v: additive noise level used to corrupt each member of the training set. 
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k: wavenumber of sinusoids used in the training set. In describing the 
training sets the index y will be used to denote position in input space, thus 
position y in input space lies directly "under" node y of the network. In ID 
simulations each training vector is a sinusoid of the form sin (k?/ + c^) + r, 
where is a random phase angle, and r is a random number sampled 
uniformly from the interval [— f, f] - this generates an topology train- 
ing set (i.e. parameterised by 1 random angle). In 2D simulations each 
training vector is a sinusoid of the form sin (k {yi cos + j/2 sin 0) + 0) + r, 
where the additional angle is a random azimuthal orientation for the 
sine wave - this generates an x topology training set (i.e. parame- 
terised by 2 independent random angles). Note that nii and must be 
an integer multiple of 27r in order to ensure that the probability density 
around the subspace generated by (j) has uniform density (in effect, the 
then becomes a circular Lissajous figure, which therefore has uniform 
probability density, unlike non-circular Lissajous figures), and thus to en- 
sure that there are no artefacts induced by the periodicity of the training 
data that might mimic the effect of dominance stripes. If nii and K,i2 are 
much greater than 2?? then it is not necessary to fix them to be integer 
multiples of 27r - because the fluctuations in the probability density are 
then negligible. Note that this restriction on the value of k, would not have 
been necessary had complex exponentials been used rather than sinusoids. 

s: number of subspaces. This fixes the number of statistically independent 
subspaces in the training set. When s — 1 the training set is generated 
exactly as above. When s — 2 the training set is split up as follows. 
The ID case has even y in one subspace, and odd y in the other subspace, 
thus successive components of each training vector alternate between the 2 
subspaces. The 2D case has even yi+y2 in one subspace, and odd yi+y2 in 
the other subspace, thus each training vector is split up into a chessboard 
pattern of interlocking subspaces. This strategy readily generalises for 
s > 3, although this is not used here. Within each subspace the training 
vector is generated as above, and the subspaces are generated so that they 
are statistically independent. 

e: update parameter used in gradient descent. This is used to update 
parameters thus 

d{Di+D2) , , 

parameter — parameter — e — (35) 

aparameter 

There are 3 internally generated update parameter, which control the up- 
date of the 3 different types of parameter, i.e. the biases, the weights, and 
the reference vectors. This is necessary because these parameters all have 
different dimensionalities, and by inspection of equation [35] the dimen- 
sionality of an update parameter is the dimensionality of the parameter 
it updates (squared) divided by the dimensionality of the Euclidean dis- 
tortion. These 3 internal parameters are automatically adjusted to ensure 
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that the average change in absolute value of each of the 3 types of param- 
eter is equal to e times the typical diameter of the region of parameter 
space populated by the parameters. This adjustment is made anew as 
each training vector is presented. The size of e determines the "memory 
time" of the node parameters. This memory time determines the effective 
number of training vectors that the nodes are being optimised against, and 
thus must be sufficiently long (i.e. e sufficiently small) that if s > 2 it is 
possible to discern that the subspaces are indeed statistically independent. 
This is crucially important, for dominance stripes cannot be obtained if 
the subspaces are not sufficiently statistically independent. So e must be 
small, which unfortunately leads to correspondingly long training times. 

3.2 Initialisation 

The training set is globally translated and scaled so that the components of all 
of its training vectors lie in the interval [—1, +1]. There are 3 parameter types 
to initialise. The weights were all initialised to random numbers sampled from 
a uniform distribution in the interval [—0.1, +0.1], whereas the biasses and the 
reference vector components were all initialised to 0. Because the 2D simulations 
took a very long time to run, they were periodically interrupted and the state 
of all the variables written to an output file. The simulation could then be 
continued by reading this output file in again and simply continuing where the 
simulation left off. Alternatively, some of the variables might have their values 
changed before continuing. In particular, the random number generator could 
thus be manipulated to simulate the effect of a finite sized training set (i.e. use 
the same random number seed at the start of each part of the simulation), or 
an infinite-sized training set (i.e. use a different random number seed at the 
start of each part of the simulation). The size of the e parameter could also 
thus be manipulated should a large value be required initially, and reduced to 
a small value later on, as required in order to guarantee that when n > 2 the 
input subspaces are seen to be statistically independent, and dominance stripes 
may emerge. 

3.3 Boundary Conditions 

There are many ways to choose the boundary conditions. In the numerical 
simulations periodic boundary conditions will be avoided, because they can 
lead to artefacts in which the node parameters become topologically trapped. 
For instance, in a 2D simulation, periodic boundary conditions imply that the 
nodes sit on a 2-torus. Leakage implies that the node parameter values are 
similar for adjacent nodes, which limits the freedom for the parameters to adjust 
their values on the surface of the 2-torus. For instance, any acceptable set of 
parameters that sits on the 2-torus can be converted into another acceptable 
set by mapping the 2-torus to itself, so that each of its "coils up" an integer 
number of times onto itself. Such a multiply wrapped parameter configuration 
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is topologically trapped, and cannot be perturbed to its original form. This 
problem does not arise with non-periodic boundary conditions. 

There are several different problems that arise at the boundaries of the array 
of nodes: 

1. The neighbourhood function N (yi, 1/2) cannot be assumed to be a rectan- 
gular top-hat centred on (2/1,2/2)- Instead, it will simply be truncated so 
that it does not fall off the edge of array of nodes, i.e. N {y 1,112) =0 for 
those (2/1,2/2) that lie outside the array. 

2. The leakage function tt (2/1 — 2/1 : 2/2 — 2/2) '^il^ be similarly truncated. How- 
ever, in this case tt (2/1 — 2/1; 2/2 — 2/2) must normalise to unity when summed 
over (2/1, 2/2), so the effect of the truncation must be compensated by scal- 
ing the remaining elements of tt (2/1 —2/112/2—2/2)- 

3. The input window for each node implies that the input array must be 
larger than the node array in order that the input windows never fall off 
the edge of the input array. 

3.4 Presentation of Results 

The most important result is the emergence of dominance stripes. For n = 2 
there are thus 2 numbers that need to be displayed for each node: the "degree of 
attachment" to subspace 1, and similarly for subspace 2. There are many ways 
to measure degree of attachment, for instance the probability density Pr (x|y) 
gives a direct measurement of how strongly node y depends on the input vector 
X, so its "width" or "volume" in each of the subspaces could be used to measure 
degree of attachment. However, in the simulations presented here (i.e. sinu- 
soidal training vectors) the degree of attachment is measured as the average of 
the absolute values of the components of the reference vector in the subspace 
concerned. This measure tends to zero for complete detachment. For ID simu- 
lations 2 dominance plots can be overlaid to show the dominance of subspaces 

1 and 2 for each node. For 2D simulations it is simplest to present only 1 of 
these plots as a 2D array of grey-scale pixels, where the grey level indicates the 
dominance of subspace 1 (or, alternatively, subspace 2)1J 

3.5 ID Simulation 

The parameter values used were: (mi, 7712) = (1,100), (ii,i2) = (1741), {wi,W2) = 
(1,21), (/i,/2) = (1,15), K = 0.3, 1/ = 0.1, s = 2, n = 400, e = 0.002. This 
value of K implies ^ ~ 1.96, so n is approximately an integer multiple of 27r, 
as required for an artefact-free simulation. In figure 13.51 a plot of the 2 dom- 
inance curves obtained after 3200 training updates is shown. This dominance 
plot clearly shows alternating regions where subspace 1 dominates and subspace 

2 dominates. The width of the neighbourhood function is 21, which is the same 

^The results of the 2D simulations do not appear in this draft paper. 
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Figure 10: Dominance plots for a ID simulation with 2 statistically independent 
training set subspaces. 



the period of the variations in the dominance plots, i.e. within each set of ad- 
jacent 21 nodes half the nodes are attached to subspace 1 and half to subspace 
2. There are boundary effects, but these are unimportant. 



A Normalisation of Pr (y|x) 



The normalisation of the expression for Pr(y|x) in equation [5] may be demon- 
strated as follows: 



^Pr(y|x) 
y=i 



1 



y=i 



y'ew(y) 



y"eN{y') 



Q(x|y") 



y' = l yGJV(y') 
1 ™ 

M ^ 

y'=i 

m\mi ■ ■ ■ md 



Ey"eAr(y')Q(x|y") 



M 



= 1 



(36) 



In the first step the order of the y and the y' summations is interchanged using 

E"=i Ey'gAr(y) (•••) = E"=i EygAr(y') (• " ')> t^c sccoud step the numerator 
and denominator of the summand cancel out. 
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B Upper Bound for Multiple Firing Model 

It is possible to simplify equation [TU] by using the following identity 

^ 1 

X - x' (yi, y2, • ■ • y„) = - V (x - x' (y,)) + - V (x' (y,) - x' (yi, y2, • • • y„)) 

T). ^ ^ 71 ^ ^ 



(37) 

Note that this holds for all choices of x' (y.;). This allows the Euclidean distance 
to be expanded thus 

llx - x' (yi,y2, • ■■yn)f 
1 " 

= ^||x-x'(y„)||' 

1=1 

1 " 

+ ^ E (x-x'(y,))-(x-x'(y,)) (38) 

2 " 

E (x-x'(y.0)(x'(yj)-x'(yi,y2,---y„)) 

^(x'(y,0-x'(yi,y2,---yn)) (39) 
Each term of this expansion can be inserted into equation (TU] to yield 

1 f 2 

term 1 = — / dxPr (x) Pr (y|x) ||x — x' (y)|| 

„ m 

ydxPr(x) Pr(yi,y2|x)(x-x'(yi))(x-x'(y2)) 



term 2 = 

term 3 = 
term 4 = 



n — 1 



n 

-2 X term 4 



yi,y2=i 



E Pr(yi,y2,---y„) 
yi,y2,-y7i=i 



x' (yi,y2, • • - yn 
-^ELix' (yi 



(40) 



Pr (yi J y2 , ■ • • yn |x) has been assumed to be a symmetric function of (yi , y2 , • • • yn 
in the first two results, and the definition of x' (yi, y2, • • • yn) in equation [11] has 
been used to obtain the third result. These results allow D in equation [TU] to 
be expanded as D = Di + D2 — D3, where 



^1 = -J2jd^PT{^)PT{y\^)\\^^^'{y)f 

„ m 

J d^Fv{^) ^ Pr(yi,y2|x)(x-x'(yi))(x-x'(y2)) 



y=i' 

2(n-l) 



yi,y2=i 
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yi,y2,---y,i=i 



x' (yi,y2,---yn) 



(41) 



By noting that D3 > 0, an upper bound for D in the form D < Di+ D2 fohows 
immediately from these results. Note that Di > whereas D2 can have either 
sign. In the special case where Pr(yi,y2|x) — Pr (yi |x) Pr (yi |x) (i.e. yi and 
y2 are independent of each other given that x is known) D2 reduces to 



D2 



2{n-l) 



J dxPr (x) 



^Pr(y|x) (x-x'(y)) 
y=i 



(42) 



which is manifestly positive. This is the form of D2 that is used throughout this 
paper. 



C Derivatives of the Objective Function 

Di and D2 are as given in equation 1121 i.e. it is assumed that Pr(yi,y2|x) = 
Pr (y2|x) Pr (y2|x) and Pr (y|x) has the scalable form given in equation |51 De- 
fine a compact matrix notation as follows 

Ly^y, = Pr (y'|y) Py.y = Pr (y^x; y) 

Py = I]y'eJV(y) Py',y {^^P)y = Z]y'=i ^y'.yPy' 

dy = X - x' (y) (L d)y = E™=i ^y,y' dy' 

{PLd)^^j:reNiy)Py,y'iLd)^, {P^ P Ld) ^ = Y.ye^(y) Py' ,y {P Ld)^. (43) 

Cy = ||x - x' (y) f {L e)y = E™=i ^y,y' ey' 

(PLe)y ^Ey'eiV(y)^y,y' iLe\, (P^Pi e)^ ^ Ey'e^(y) ^y',y (^^ e)^, 

d ^ Er=i (^^P)y dy or d ^ E^i ^PP^)y 

Using this matrix notation, the functions fi (x, y), £2 (x, y), gi (x, y), and 52 (x, y) 
may be defined as 

fi(x,y) ^ (P^p)ydy (44) 

f2(x,y) ^ {L^p)yd (45) 

gi(x,y) ^ py{Le)^-{P^PLe)^ (46) 

g2(x,y) ^ (py(id)y-(P^Pid)J -d (47) 

The variation of Pr (y|x; y') is then given by 

5Pr(y|x;y) = (y|x; y ) _ ^^^^^^^^^^ ^ ^ ^^l^,, J 

= Pr(y|x;y') ^ 5 logQ (x|y") (V%y - (y"|x; /)) 
y"ew(y') 

= Py,,y J2 '51ogQ(x|y")(V',y-^y'.y") (48) 
y"eN{y') 



24 



In order to rearrange the expressions to ensure that only a single dummy index 
is required at every stage of evaluation of the sums it will be necessary to use 
the result 

m m 

E E =E E (49) 

y=iy'eiv(y) y'=iyejv(y') 

C.l Calculate ^5^. 

The derivative is given by 

7—^ 4 n 111 

a^ = -;af /'^^^'^^"^EP'^W) E Pr(y'|x;y")(x-x'(y)) 

y'=i y"eJV(y') 

(50) 

Use matrix notation to write this as 



y'=l y"eiV(y') 

Finally remove the explicit summations to obtain the required result 
dDi 4 



(51) 



ax'(y) nM 

4 



nM 



I d^Pv{^){L^p)^dy (52) 
J dxPr(x) fi(x,y) 



C.2 Calculate 

The derivative is given by 

dD2 _ 4 (n - 1) 
5x' (y) 



^|dxPr(x)(f;Pr(y|y') ^ Pr(y'|x;y")) 

\y' = l y"eJV(y') / 

mm \ 

EEP'^(yly') E Pr(y'|x;y")(x-x'(y)) (53) 

^^y=ly'=l y"eJV(y') / 



Use matrix notation to write this as 
dD2 4 (n - 1) 



ax'(y) nM2 



/'dxPr(x) I f;V,y ^ Py.,y,] 
Vy-i y"eJV(y') / 



EEV,y E ^y".y'dy) (54) 

^y=ly'=l y"eiV(y') 
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Finally remove the explicit summations to obtain the required result 

dD2 4(n- 1) /• , ^ , , , .,, , -i 
^ ,,\ = ^ / rfxPr X (L^p) d 

C.3 Calculate ^^^J^ 
The differential is given by 



^Di = ^Y.Id''^'i'')J2^'(y\y') E '5Pr(y'|x;y")||x-x'(y)||^ 
y=l y'=l y"eJV(y') 



Use matrix notation to write this as 

m ^ 

y=i"' 



(56) 



SD, = J-y f dxPr (x) f - ^y'=^ ^^'^"^ Sy"6^(y') ^y".y' 



nM V ^ Ey"'ejv(y") -^logQ (x|y"') (V",y' - ^^''-y'"^ 

(57) 

Reorder the summations to obtain 



5Di = 



2 . / E""=i'51ogQ(x|y'")Ey"eiV(y'") 

\ \ -^y",y"' Z^y'ejv(y") -^y",y' / ^ 



(58) 

Relabel the indices and evaluate the sum over the Kronecker delta to obtain 



m „ 

y=i"' 



5Di = -^yj dxPv (x) S log g (x|y) (59) 
y= 



^Ey'ew(y) -Py'.yj (^I]y'=i -^y,y' ^y' 
^ (^Ey"'e7V(y) -^y"':y Ey"eAr(y"') -Fy"',y" Ey'=i Ly",y' ^y' 
Finally remove the explicit summations to obtain the required result 
2 



6Di = 



nM 

y 



2 " 



nM 

y 



^ / rfx Pr (x) S log Q (x|y) (py (L e)^ - (P^P L e)y) 
y=i"' 

m „ 

^ / rfx Pr (x) ffi (x, y) 6 log Q (x|y) (60) 
y=i"' 



C.4 Calculate , , 

51ogQ(x|y) 
The differential is given by 
4(n- 1) 



J dxPr (x) 
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EEp^W) E Pr(y'|x;y")(x-x'(y)) 

yy=ly' = l y"6iV(y') 
m m 

EEP^(yly') E '5Pr(y'|x;y")(x-x'(y)) | (61) 

yy=ly' = l y"6iV(y') 

Use matrix notation to write this as 

= / dxPr (x) I > : > : > : p,..,,, d, i (62) 



nM2 



\y 

Reorder the summations to obtain 



//mm \ 
dxPr(x)K]^V,y ^ ^'y'.y'dy 

yy=ly'=l y"eJV(y') / 

I" ■^m T \^ T3 ^ 

sr^ ^y'=i -^y'>y Z^y"eN{y') y".y' ^ 

^ X Ey"'eiv(y") Clog's (^ly"') (V",y' - -Py",y"') ^ 



4(n-l) 



//mm \ 
dxPr(x) E ^y".y'dy 

\y=ly' = l y"eJV(y') / 

Ey"=i'51ogQ(x|y"')Ey"e.Y(y"') 
Ey'eiv(y") V",y'^y",y' \ ^ , 

V -^y",y"'Ey'eiv(y")^y".y' ) ^y=^ """^ '"^ '''' 



(63) 



Relabel the indices and evaluate the sum over the Kronecker delta to obtain 



5D2 = ^^^^ £ f rfx Pr (x) 5 log Q (x|y) 
y=i 



(Ey'ew(y) -Py'.y) (Ey'=i iy,y' dy') 
- (Ey"'ejv(y) -Py"',y Ey"ejv(y"') -Py"',y" Ey'=i -^y",y' dy') 

EEV.y E ^y.y'dy (64) 

^^y=ly'=l y"eiV(y') / 

Finally remove the explicit summations to obtain the required result 
5D2 = i^^^f;|dxPr(x)51ogQ(x|y) 

x(py(id)y-(P^PLd)j-d 



4 (n - 1) 



nM2 

y- 



m ^ 

y=i"' 



dx Pr (x) 52 (x, y) 6 log Q (x|y) (65) 
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D Expression for Di + D2 in Terms of x' (c) 

From equation [T2] Di + D2 can be written as 

D,+D, = - /dxPr(x)^Pr(y|x)(||x'(y)f -2x.x'(y) 
nj V 



2(n- 1) 



dxPr(x) 



E™iPr(y|x)x'(y) 



1(66) 



-constant 

where the constant terms do not depend on x' (y). However, from equation 1121 
the derivative ^'■P^t''^''-* can be written as 

ax'(y) 

d (Di + D2) 

n J 

( X - x' (y) ^ 

(67) 



ax' (y) 



-- / dxPr(x)Pr(y|x) 
n J 

X - x' (y) 
+ ("-l)Ey=iPr(y'|x) (x-x'(y')) 



Using Bayes' theorem the stationarity condition ^^g^^^^"^ = yields a matrix 
equation for the x' (y) 

nf dxPr(x|y) x = (n-l) ^ (J dxPr (x|y) Pr (y'lx)") x' (y') + x' (y) 

y'=i 



which may then be used to replace all instances of x in equation [55) This yields 
the result 



2 r ™ 

D,+D2 = — / dxPr(x) VPr(y|x)||x'(y) 

y=i 



2(n- 1) 
n 

+constant 



dx Pr (x) 



^Pr(y|x)x' (y) 
y=i 



(69) 



E Comparison of Di + D2 for Different Types of 
Optima 

In order to compare the value of Di + D2 that is obtained when different types 
of supposedly optimum configurations of the threshold functions Q (x|y) are 
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tried, the x' (y) that solves ^^g^^^^^'' = (see appendix [D| must be inserted 
into the expression for Di + D2. In the following derivations the constant term 
is omitted, and the definition Rm = / (ixPr(x|y) x (see equation I33|) has 
been used. 



E.l Type 1 Optimum: all the nodes are attached one sub- 
space 

In equation [18] Di + D2 becomes 



D1+D2 = -'- 



[ dxPr(x)^Pr(?;|x) (j dxi P 

y=l 



^i\y) xi,o 



2(n - 1) 



n 



dn. Pr (x) 



^Pr(y|x) /" dxiPr(xi|2;) xi,0 



_ 2 2(n-l) 

— — ^Af -Km 

n n 

= —'2'Rm 



(70) 



E.2 Type 2 Optimum: all the nodes are attached both 
subspaces 

In equation [18] Z?! + D2 becomes 



2 r 

D1+D2 = / dxPr(x)^Pr(y|x) 



j dxi Pr (xi ly) Xi , J dx2 Pr (x2 |y) X2 
2(n- 1) 



dxPr(x) 



^Pr(j/|x) (Jdxi Pr(xi|j/)xi, /'dx2Pr(x2|y)x2 

y=l 



2 2 (n - 1) 



2R, 



M 



-4i? 



'M 



(71) 



E.3 Type 3 Optimum: half the nodes are attached one 
subspace and half are attached to the other 

In equation [TS] Z?! + D2 becomes 



D,+D2 = -- I dxPr(x) 

n \n+\ J J 
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